Your request cart is empty!
Dataset Description
63, 70,954 Words | 1,119 Titles |
XML format | 6 domains
Malayalam is a highly agglutinative and morphologically rich
language. The actual pattern of language use in natural texts reveals the
evidence of language trait. Government of India set up Linguistic Data
Consortium for Indian Languages to help those who endeavor in the language
development field. LDC-IL Malayalam Text Corpus developed according to various
factors such as quality of the text, representativeness, retrievable
format, size of corpus, authenticity, etc. For collecting text corpus
LDC-IL adopts a standard category list of various domains and a prior set of
criteria. The corpus of Malayalam text can be broadly classified as literary
and non-literary texts. A huge amount of literary texts are available in
Malayalam but scientific texts are less thus LDC-IL attempts to develop
balanced text corpora of Malayalam. Data has been collected from books,
magazines, and newspapers and it is verified to true to the original texts then
stored.
Malayalam Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from contemporary text is typed and crawled methods. LDC-IL Malayalam Text Corpus size is 63, 70,954 words drawn from 1,119 different titles. The six major domains are Aesthetics, Commerce, Official Documents, Social Sciences, Mass Media and Science & Technology.
The available Text Corpus details:
Domains |
Words |
Percentage of Total Corpus |
Aesthetics |
25,77,090 |
40.45 % |
Commerce |
3,13,135 |
4.92 % |
Official Documents |
7,733 |
0.12 % |
Mass Media |
21,35,621 |
13.74 % |
Science and Technology |
16,79,511 |
33.52 % |
Social Sciences |
8,75,568 |
7.25 % |
A detailed explanation of the Malayalam Text Corpus will be available
in the Malayalam Raw Text Corpus Documentation.
For any research-based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Saritha S.L., Rejitha K.S. & Sajila S. 2019. A Gold Standard Malayalam Raw Text Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan & L.
Ramamoorthy. 2019. "LDC-IL
Raw Text Corpora: An Overview" in Linguistic
Resources for AI/NLP in Indian Languages, Central Institute of Indian
Languages, Mysore. pp. 1-10.
Item specifics
- Authors Ramamoorthy L., Narayan Choudhary, Saritha S L, Rejitha K.S., Sajila S.
- Corpus Type Raw Corpus
- Catalogue Number 1141
- ISBN 978-81-7343-240-8
- Data Source Typed+Crawled
- Character Count 67279783
- Word Count 6370954
- Release Date 04-Apr-2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.